This notebook is provided as a case of study of Blockchain-Based financial assets, in particular, Bitcoin and Ethereum cryptocurrencies. Data within this notebook was extracted from the Kraken exchange and making use of the ccxt library.
In order to run this notebook, it is necessary to have installed and/or have the requirements.txt file with the following:
The following are the file dependencies that are needed to run this notebook:
%%capture
# Install all the pip packages in the requirements.txt
import sys
!{sys.executable} -m pip install -r requirements.txt
# -- Generic
import pandas as pd
import numpy as np
import json
# -- Visualization
import plotly.graph_objects as go
# -- Project scripts
import visualizations as vs
import functions as fn
Historical Orderbook data is stored in a JSON formated file, which stands for Java Script Object Notation, in practical terms, it is very similar to a python dictionary since it has a list-like structure of storing the data, with one or several keys and each of which can have more keys and or data points.
Remember also that a Pandas DataFrame is also in its core form a dictionary where each colum is a key and the column contents are each key contents. So, in a way, a JSON file is one option to represent a multidimensional DataFrame.
To read the contents of the file is very simple, by using the library json, it is a pre-installed one and you do need to import it but its already installed in the python base installation.
# Opening JSON file
f = open('files/orderbooks_05jul21.json')
# Returns JSON object as a dictionary
orderbooks_data = json.load(f)
Lets check the contents of the file.
orderbooks_data.keys()
dict_keys(['bitfinex', 'kraken'])
Allright, we have 2 keys, each of which corresponds to different exchanges where BTC/USDT is listed. Lets get store the timestamps of the two in sepparate objects, we will need those for later calculations.
bitfinex_ts = list(orderbooks_data['bitfinex'].keys())
kraken_ts = list(orderbooks_data['kraken'].keys())
Now lets see some other info of each of the list, in priciple we are interested in just one for now so this calculation can apply to any of those two. Lets use as an example the data we have from Bitfinex.
This two exchange timestamps comparisson will definitely be of use for the final project XEMM
Inside the functions file, create a function named f_timestamps_info with input parameteres ts_list_o and ts_list_d, and the output must be a dict type object with the following keys and content:
Place the code for this function in the functions.py file and should be callable from the following cell.
q1_results = fn.f_compare_ts(ts_list_o=bitfinex_ts, ts_list_d=kraken_ts)
To practice the use of display function on jupyter notebooks, you should delivery the exact same result
# Delete this comment and place your code here in this cell
'Timestamps in Origin data:'
'First Timestamp: 2021-07-05T13:06:46.571Z'
'Last Timestamp: 2021-07-05T14:06:49.417Z'
'Total number of orderbooks: 3602'
'Timestamps in Destination data:'
'First Timestamp: 2021-07-05T13:06:46.571Z'
'Last Timestamp: 2021-07-05T14:06:49.417Z'
'Total number of orderbooks: 3602'
'Exact match of Timestamps: 3602'
'First 2 values are: '
['2021-07-05T13:33:43.414Z', '2021-07-05T14:05:25.495Z']
'Last 2 values are: '
['2021-07-05T13:26:51.068Z', '2021-07-05T13:11:32.913Z']
From now and on for the purpose of this lab, we are going to use only the information of one exchange, and also we are going to have the following order of columns (it is just a convention) 'bid_size', 'bid', 'ask' and 'ask_size'
ob_data = orderbooks_data['kraken']
There exists a posibility of having None or empty values for any timestamp contained in the data set, so for the next sections on this lab we need to use the keys of the dict with not None values.
Next incllude your code that generates the output that is showed.
# Delete this comment and place your code here in this cell
'Number of historical Kraken orderbooks:'
'Before dropping Nones: 3602'
'After dropping Nones: 1201'
Lets see the first timestamp from where we do have data
display('First timestamp: ' + list(ob_data.keys())[0])
print('')
display('The Orderbook data is the following:')
print('')
display(pd.DataFrame(ob_data[list(ob_data.keys())[0]]))
'First timestamp: 2021-07-05T13:06:49.495Z'
'The Orderbook data is the following:'
| ask_size | ask | bid | bid_size | |
|---|---|---|---|---|
| 0 | 3.352 | 28298.4 | 28298.3 | 0.880 |
| 1 | 0.020 | 28300.0 | 28293.9 | 0.350 |
| 2 | 0.074 | 28305.4 | 28293.5 | 0.377 |
| 3 | 0.123 | 28305.5 | 28292.2 | 0.011 |
| 4 | 0.204 | 28307.3 | 28292.0 | 1.221 |
| ... | ... | ... | ... | ... |
| 95 | 0.001 | 28430.4 | 28179.1 | 3.726 |
| 96 | 0.001 | 28431.0 | 28178.9 | 3.059 |
| 97 | 3.840 | 28431.8 | 28169.6 | 3.733 |
| 98 | 3.032 | 28432.1 | 28168.2 | 2.756 |
| 99 | 4.470 | 28446.7 | 28167.8 | 2.498 |
100 rows × 4 columns
... And what about the last timestamp we have data from?
display('Last timestamp: ' + list(ob_data.keys())[-1])
print('')
display('The Orderbook data is the following:')
print('')
display(pd.DataFrame(ob_data[list(ob_data.keys())[-1]]))
'Last timestamp: 2021-07-05T14:06:49.417Z'
'The Orderbook data is the following:'
| ask_size | ask | bid | bid_size | |
|---|---|---|---|---|
| 0 | 0.810 | 28396.9 | 28396.8 | 0.001 |
| 1 | 0.743 | 28397.9 | 28389.7 | 0.016 |
| 2 | 0.005 | 28400.9 | 28389.3 | 0.165 |
| 3 | 0.190 | 28402.7 | 28388.1 | 0.001 |
| 4 | 0.002 | 28402.9 | 28386.8 | 0.550 |
| ... | ... | ... | ... | ... |
| 95 | 0.113 | 28512.9 | 28286.4 | 3.886 |
| 96 | 0.063 | 28513.1 | 28282.4 | 9.627 |
| 97 | 3.117 | 28514.2 | 28279.9 | 3.693 |
| 98 | 3.000 | 28517.0 | 28279.8 | 0.001 |
| 99 | 0.001 | 28517.6 | 28279.6 | 4.396 |
100 rows × 4 columns
All right, looks good. We just need to switch the columns in order as its more often used in industry. After that, we should make a plot to have a different visual representation of that data, lets use plotly
ob_data = {orderbook: pd.DataFrame(ob_data[orderbook])[['bid_size', 'bid', 'ask', 'ask_size']]
for orderbook in list(ob_data.keys())}
ob_data[list(ob_data.keys())[0]]
| bid_size | bid | ask | ask_size | |
|---|---|---|---|---|
| 0 | 0.880 | 28298.3 | 28298.4 | 3.352 |
| 1 | 0.350 | 28293.9 | 28300.0 | 0.020 |
| 2 | 0.377 | 28293.5 | 28305.4 | 0.074 |
| 3 | 0.011 | 28292.2 | 28305.5 | 0.123 |
| 4 | 1.221 | 28292.0 | 28307.3 | 0.204 |
| ... | ... | ... | ... | ... |
| 95 | 3.726 | 28179.1 | 28430.4 | 0.001 |
| 96 | 3.059 | 28178.9 | 28431.0 | 0.001 |
| 97 | 3.733 | 28169.6 | 28431.8 | 3.840 |
| 98 | 2.756 | 28168.2 | 28432.1 | 3.032 |
| 99 | 2.498 | 28167.8 | 28446.7 | 4.470 |
100 rows × 4 columns
Inside the visualizations.py file, there is a function named plot_orderbook with input parameteres ob_data_plot, using that function, you need to first select the day that had the highest total volume in the orderbook, and then use it to plot the first 20 levels for Bid side and first 20 levels for Ask side.
Note: If you find more than 1 timestamp with the exact same volume, use the one that is closest to the present.
# Delete this comment and place your code here in this cell
Now we need to calculate the historical bid, ask and spread. That means to calculate them using the Top of the Book, i.e. using only the best bid, the bes ask. So, it turns out we need to iterate over all orderbook data we have and calculate those values.
So, create a DataFrame, df_ts_tob with the following contents:
# Delete this comment and place your code here in this cell
| timestamp | bid | ask | spread | |
|---|---|---|---|---|
| 0 | 2021-07-05T13:06:49.495Z | 28298.3 | 28298.4 | 0.1 |
| 1 | 2021-07-05T13:06:52.415Z | 28298.3 | 28298.4 | 0.1 |
| 2 | 2021-07-05T13:06:55.500Z | 28309.9 | 28310.0 | 0.1 |
| 3 | 2021-07-05T13:06:58.414Z | 28309.9 | 28310.0 | 0.1 |
| 4 | 2021-07-05T13:07:01.503Z | 28319.9 | 28320.0 | 0.1 |
| ... | ... | ... | ... | ... |
| 1196 | 2021-07-05T14:06:37.413Z | 28402.8 | 28402.9 | 0.1 |
| 1197 | 2021-07-05T14:06:40.499Z | 28396.8 | 28396.9 | 0.1 |
| 1198 | 2021-07-05T14:06:43.413Z | 28396.8 | 28396.9 | 0.1 |
| 1199 | 2021-07-05T14:06:46.749Z | 28396.8 | 28396.9 | 0.1 |
| 1200 | 2021-07-05T14:06:49.417Z | 28396.8 | 28396.9 | 0.1 |
1201 rows × 4 columns
In order to see some interesting time periods, lets see if we can find the timestamps where spread was above to 8.0 Usd.
# Delete this comment and place your code here in this cell
| timestamp | bid | ask | spread | |
|---|---|---|---|---|
| 32 | 2021-07-05T13:08:25.495Z | 28321.6 | 28332.3 | 10.7 |
| 33 | 2021-07-05T13:08:28.414Z | 28321.6 | 28332.3 | 10.7 |
| 36 | 2021-07-05T13:08:37.500Z | 28326.5 | 28334.6 | 8.1 |
| 37 | 2021-07-05T13:08:40.443Z | 28326.5 | 28334.6 | 8.1 |
| 122 | 2021-07-05T13:12:55.501Z | 28403.1 | 28414.6 | 11.5 |
| 123 | 2021-07-05T13:12:58.421Z | 28403.1 | 28414.6 | 11.5 |
| 124 | 2021-07-05T13:13:01.502Z | 28403.0 | 28414.6 | 11.6 |
| 130 | 2021-07-05T13:13:19.660Z | 28399.5 | 28409.7 | 10.2 |
| 131 | 2021-07-05T13:13:22.421Z | 28399.5 | 28409.7 | 10.2 |
| 183 | 2021-07-05T13:15:58.501Z | 28405.4 | 28413.8 | 8.4 |
| 184 | 2021-07-05T13:16:01.505Z | 28404.3 | 28412.4 | 8.1 |
| 185 | 2021-07-05T13:16:04.416Z | 28404.3 | 28412.4 | 8.1 |
| 223 | 2021-07-05T13:17:58.494Z | 28366.8 | 28375.7 | 8.9 |
| 224 | 2021-07-05T13:18:01.423Z | 28366.8 | 28375.7 | 8.9 |
| 246 | 2021-07-05T13:19:07.500Z | 28340.7 | 28349.2 | 8.5 |
| 437 | 2021-07-05T13:28:40.502Z | 28361.8 | 28370.3 | 8.5 |
| 478 | 2021-07-05T13:30:43.497Z | 28386.4 | 28396.1 | 9.7 |
| 479 | 2021-07-05T13:30:46.415Z | 28386.4 | 28396.1 | 9.7 |
| 662 | 2021-07-05T13:39:55.501Z | 28372.4 | 28381.3 | 8.9 |
| 663 | 2021-07-05T13:39:58.431Z | 28372.4 | 28381.3 | 8.9 |
| 877 | 2021-07-05T13:50:40.682Z | 28431.0 | 28439.6 | 8.6 |
| 878 | 2021-07-05T13:50:43.414Z | 28431.0 | 28439.6 | 8.6 |
Lets visualize how the bid and the ask have been in the whole time series data we have. You have to create a time series plot, using plotly, exactly like this one:
# Delete this comment and place your code here in this cell
We will need to see the spread in time batches, so, lets see if you can split the data into hours and create a boxplot for every group, like the following:
# Delete this comment and place your code here in this cell
Create a function named f_spread_outliers which input parameter is spred_data and its output the quantity of outliers according to the following formula.
$Outliers = mean \pm 1.5 * IQR $
# Delete this comment and place your code here in this cell
Using the pdf document, propose a series of steps to produce a Theoretical spread calculation and compare it with the spread you have from the data above.
# Delete this comment and place your code here in this cell
Using the pdf document, propose a series of steps and codes to test if the mid price is a martingale process
# Delete this comment and place your code here in this cell
You can youse the R package vrtest
[1] Munnoz, 2020. Python project template. https://github.com/iffranciscome/python-project. (2021).